1.0 - Install packages, load libraries

Here we source the bit of code that performs this operation. essentially, this one-line command runs an entire R script file. This can be handy when you load the same libraries in different steps of your analysis

source("./code/install_packages_load_libraries.R")

2.0 - Load and check data

Here we will load data from 3 species in the Brassicacea family experiment: Arabidopsis thaliana, Brassica oleraceae and Isatis tinctoria. This data has been heavily filtered to reduce complexity and size.

# load object
load("./data/ps_rarefied.Rdata")

#check ps object
ps_rarefied # full ps object
## phyloseq-class experiment-level object
## otu_table()   OTU Table:         [ 1000 taxa and 143 samples ]
## sample_data() Sample Data:       [ 143 samples by 44 sample variables ]
## tax_table()   Taxonomy Table:    [ 1000 taxa by 7 taxonomic ranks ]
## refseq()      DNAStringSet:      [ 1000 reference sequences ]
otu_table(ps_rarefied)[1:5,1:5] # 5 first rows across 5 first columns
## OTU Table:          [5 taxa and 5 samples]
##                      taxa are rows
##         49_16S 50_16S 51_16S 52_16S 53_16S
## bASV_5      90    299    109     87    231
## bASV_7      25    156    154    446     58
## bASV_9     157    191    264    108    145
## bASV_11    337    248    193    216    100
## bASV_12    201    437    272     71    154
tax_table(ps_rarefied)[1:5,] # 5 first rows across all columns
## Taxonomy Table:     [5 taxa by 7 taxonomic ranks]:
##         Kingdom       Phylum                Class                   
## bASV_5  "k__Bacteria" "p__Proteobacteria"   "c__Gammaproteobacteria"
## bASV_7  "k__Bacteria" "p__Proteobacteria"   "c__Gammaproteobacteria"
## bASV_9  "k__Bacteria" "p__Bacteroidota"     "c__Bacteroidia"        
## bASV_11 "k__Bacteria" "p__Proteobacteria"   "c__Alphaproteobacteria"
## bASV_12 "k__Bacteria" "p__Actinobacteriota" "c__Actinobacteria"     
##         Order                 Family                 Genus            
## bASV_5  "o__Burkholderiales"  "f__Comamonadaceae"    "g__Acidovorax"  
## bASV_7  "o__Pseudomonadales"  "f__Pseudomonadaceae"  "g__Pseudomonas" 
## bASV_9  "o__Chitinophagales"  "f__Chitinophagaceae"  "g__Niastella"   
## bASV_11 "o__Sphingomonadales" "f__Sphingomonadaceae" "g__Sphingomonas"
## bASV_12 "o__Streptomycetales" "f__Streptomycetaceae" "g__Streptomyces"
##         ASV_id   
## bASV_5  "bASV_5" 
## bASV_7  "bASV_7" 
## bASV_9  "bASV_9" 
## bASV_11 "bASV_11"
## bASV_12 "bASV_12"
ps_rarefied@sam_data [1:5,1:5]# metadata
##        Target Harvest_ID Block_row_column Sp_speed  Stress
## 49_16S    16S         49            1.2.4       M1 Control
## 50_16S    16S         50            2.3.1       M1 Control
## 51_16S    16S         51            3.2.9       M1 Control
## 52_16S    16S         52            4.1.6       M1 Control
## 53_16S    16S         53           5.6.14       M1 Control

3.0 - Create a single taxmap object for a single heat tree

For a simple start, let’s create individual heat trees from an individual phyloseq object

# make a copy of you rarefied ps object
ps_rarefied_filt<-ps_rarefied

#remove unecessary taxonomic info ("ASV_id") by updating the tax table with a subset of the tax table
tax_table(ps_rarefied_filt)<-tax_table(ps_rarefied_filt)[,1:6]


# let's remove the "r__"ranks from the taxonomy, they can be useful but will pollute our plot
tax_table(ps_rarefied_filt)[, colnames(tax_table(ps_rarefied_filt))] <- gsub(pattern = "[a-z]__", # regular expression 
                                                                   x = tax_table(ps_rarefied_filt)[, colnames(tax_table(ps_rarefied_filt))], # "df"
                                                                   replacement = "") # replacement for pattern
# transform from phyloseq to taxmap object
taxmap_obj<-parse_phyloseq(ps_rarefied_filt)

3.1 - EXERCISE check taxmap object you just created

taxmap_obj is a object from the Taxmap class, but can be navigated as a nested list. Explore taxmap_obj with “$” and tab-completion. can you find the representative sequences in the data layer?

# the taxmap object is a data container, just like phyloseq
taxmap_obj

# this object has much more than what you can see on one print, just like much other R output you generate. 
taxmap_obj$data$tax_data # taxa tables...
taxmap_obj$taxa # taxonomy names, ranks and codes

3.2 - Create the heat tree visualization

Now we can create a heat tree, with colors indicating the number of sequences in each taxa.

The first tree is actually quite busy - depending on your goal, you might want to further filter the input ASVs to a smaller subset like all differentially abundant ASVs or features defined as core microbiome communities

# create a heat tree!
set.seed(1) # set a seed so the layout can be exactly the same every time
heat_tree(.input = taxmap_obj, # your taxmap object from the metacoder package
          node_size = n_obs, # n_obs is a function that calculates the number of OTUs per taxon 
          node_color = n_obs, # this function from metacoder is called with non-standard evaluation (no arguments)
          node_label = taxon_names, # labels for the taxons
          layout = "davidson-harel", # The primary layout algorithm
          initial_layout = "reingold-tilford", # The layout algorithm that initializes node locations
          output_file = "./results/heat_tree_test.pdf") # A PDF file you will export this plot to

Can you well which is the most diverse phylum? which are the 3 most diverse families?

QUESTION BREAK

3.3 - EXERCISE check arguments

check what’s produced with the n_obs argument/function used in the tree above and also check the taxa names

#n_obs is a function that takesyour taxmap object as input, and return the number of sequences in each taxa
n_obs(taxmap_obj)

#in metacoder, ab, ac, ad, etc refer each to a different taxonomy
taxmap_obj$taxa

3.4 - Remove some taxa the heat tree visualization

It might be too dense/complex to look, so let’s remove taxa names that can be unfamiliar like “phylum NB1-J”

# create a heat cleaner tree!
taxmap_obj %>%
  metacoder::filter_taxa(grepl(pattern = "^[a-zA-Z]+$", taxon_names)) %>% # remove "odd" taxa
  heat_tree(node_label = taxon_names,
            node_size = n_obs,
            node_color = n_obs,
            layout = "davidson-harel",
            initial_layout = "reingold-tilford",
            output_file = "./results/heat_tree_test_cleaner.pdf")

3.5 - Modify the tree with available options

There are many options to customize these trees, we will only explore a few of them

3.5.1 - Truncate at order level

This tree stops at Order level

# create a heat cleaner tree!
taxmap_obj %>%
  metacoder::filter_taxa(grepl(pattern = "^[a-zA-Z]+$", taxon_names)) %>% # remove "odd" taxa
  metacoder::filter_taxa(taxon_ranks == "Order", supertaxa = TRUE) %>% # subset to the Order rank
  heat_tree(node_label = taxon_names,
            node_size = n_obs,
            node_color = n_obs,
            layout = "davidson-harel",
            initial_layout = "reingold-tilford",
            output_file = "./results/heat_tree_order.pdf")

3.5.2 - make individual roots for specific taxonomic groups

This tree has a separated root at different taxonomic levels, and custom colors

# create a heat cleaner tree!
taxmap_obj %>%
  metacoder::filter_taxa(grepl(pattern = "^[a-zA-Z]+$", taxon_names)) %>% # remove "odd" taxa
  metacoder::filter_taxa(taxon_names %in% c("Rhizobiales", "Actinobacteria", "Xanthomonadaceae"),
              subtaxa = TRUE, 
              supertaxa = FALSE) %>% 
  heat_tree(node_label = taxon_names,
            node_size = n_obs,
            node_color = n_obs,
            node_color_range = c("blue", "yellow", "pink", "red"),
            layout = "davidson-harel",
            initial_layout = "reingold-tilford",
            output_file = "./results/heat_tree_sep_roots.pdf")

3.6 - Compare 2 groups of samples

lets now compare ASV abundances between two groups of samples, such as plant species in brassicaceae lineage I and II (Sp_Lineage_Walden).

3.6.1 - calculate taxon abundance

“tax_abund” is added as a data layer of the taxmap object. it defines the number of sequences on every single taxon - all phyla, all classes, orders… this is different from counts of every single ASV! this data layer is needed on the next step.

#get abundance per taxon
taxmap_obj$data$tax_abund<-calc_taxon_abund(obj = taxmap_obj, # the taxmap object
                                            data = "otu_table", # layer of data table to use
                                            cols = taxmap_obj$data$sample_data$sample_id) #sample names

#check the new data layer you created. note that "ab" refers to "k__Bacteria",  and "aac" to "p__Proteobacteria".  
taxmap_obj$data$tax_abund
## # A tibble: 457 x 144
##    taxon_id `49_16S` `50_16S` `51_16S` `52_16S` `53_16S` 54_16~1 55_16~2 56_16~3
##    <chr>       <dbl>    <dbl>    <dbl>    <dbl>    <dbl>   <dbl>   <dbl>   <dbl>
##  1 ab          10000    10000    10000    10000    10000   10000   10000   10000
##  2 ac           7137     7430     6677     7995     7774    7459    7183    7574
##  3 ad           1188     1012     1351      739      769     951    1242     998
##  4 ae            739     1152     1129      295      535     938    1085     829
##  5 af            355      101      349      363      267     186     135     165
##  6 ag            191       82      189      309      286     208     134     147
##  7 ah             42       65       80       22       40      47      28      81
##  8 ai             23       24       43       38       40      21      13      27
##  9 aj             10        7       18       31       12       5       6       0
## 10 ak             16       18       21       13       25      19      27      17
## # ... with 447 more rows, 135 more variables: `57_16S` <dbl>, `58_16S` <dbl>,
## #   `59_16S` <dbl>, `60_16S` <dbl>, `61_16S` <dbl>, `62_16S` <dbl>,
## #   `63_16S` <dbl>, `64_16S` <dbl>, `65_16S` <dbl>, `66_16S` <dbl>,
## #   `67_16S` <dbl>, `68_16S` <dbl>, `69_16S` <dbl>, `70_16S` <dbl>,
## #   `71_16S` <dbl>, `72_16S` <dbl>, `73_16S` <dbl>, `74_16S` <dbl>,
## #   `75_16S` <dbl>, `76_16S` <dbl>, `77_16S` <dbl>, `78_16S` <dbl>,
## #   `79_16S` <dbl>, `80_16S` <dbl>, `81_16S` <dbl>, `82_16S` <dbl>, ...
#quickly compare those values to the OTU table to see the differnce:
taxmap_obj$data$otu_table
## # A tibble: 1,000 x 145
##    taxon_id otu_id  `49_16S` `50_16S` `51_16S` `52_16S` `53_16S` 54_16~1 55_16~2
##    <chr>    <chr>      <dbl>    <dbl>    <dbl>    <dbl>    <dbl>   <dbl>   <dbl>
##  1 jq       bASV_5        90      299      109       87      231      40      78
##  2 jr       bASV_7        25      156      154      446       58     191     376
##  3 js       bASV_9       157      191      264      108      145     128     245
##  4 jt       bASV_11      337      248      193      216      100     499     403
##  5 ju       bASV_12      201      437      272       71      154     240     362
##  6 jv       bASV_13      226      178       89      162      115     255     230
##  7 jw       bASV_14      374      159      301      230      321     167     229
##  8 fi       bASV_15      157       49       72      135       26     199      92
##  9 jy       bASV_16       62      373      173       96      144     255     201
## 10 fj       bASV_18      317      160      161      143      191     165     182
## # ... with 990 more rows, 136 more variables: `56_16S` <dbl>, `57_16S` <dbl>,
## #   `58_16S` <dbl>, `59_16S` <dbl>, `60_16S` <dbl>, `61_16S` <dbl>,
## #   `62_16S` <dbl>, `63_16S` <dbl>, `64_16S` <dbl>, `65_16S` <dbl>,
## #   `66_16S` <dbl>, `67_16S` <dbl>, `68_16S` <dbl>, `69_16S` <dbl>,
## #   `70_16S` <dbl>, `71_16S` <dbl>, `72_16S` <dbl>, `73_16S` <dbl>,
## #   `74_16S` <dbl>, `75_16S` <dbl>, `76_16S` <dbl>, `77_16S` <dbl>,
## #   `78_16S` <dbl>, `79_16S` <dbl>, `80_16S` <dbl>, `81_16S` <dbl>, ...

3.6.2 - Calculate taxon prevalence/occumpancy

here we define taxa “occurrence” by determining in how many samples of each group they appear

#get occurrence of per lineage
taxmap_obj$data$tax_occ_2lineages <- calc_n_samples(obj = taxmap_obj, 
                                                    data = "tax_abund", 
                                                    cols = taxmap_obj$data$sample_data$sample_id,
                                                    groups = taxmap_obj$data$sample_data$Sp_Lineage_Walden) # What category each sample is



#check the object. it shows how many times a taxon is represented in lineage I or II
taxmap_obj$data$tax_occ_2lineages
## # A tibble: 457 x 3
##    taxon_id lineage_II lineage_I
##    <chr>         <int>     <int>
##  1 ab               95        48
##  2 ac               95        48
##  3 ad               95        48
##  4 ae               95        48
##  5 af               95        48
##  6 ag               95        48
##  7 ah               95        48
##  8 ai               95        48
##  9 aj               94        40
## 10 ak               95        48
## # ... with 447 more rows

3.6.3 - Calculate log fold changes between two groups

Here we see log fold changes

# this will calculate log2 median ratios and p values for a wilcoxcon test within taxas in this lineage
taxmap_obj$data$diff_table_2lineages <- compare_groups(obj = taxmap_obj,
                                                      data = "tax_abund",
                                                      cols = taxmap_obj$data$sample_data$sample_id, 
                                                      groups = taxmap_obj$data$sample_data$Sp_Lineage_Walden) 

#check the object.
taxmap_obj$data$diff_table_2lineages
## # A tibble: 457 x 7
##    taxon_id treatment_1 treatment_2 log2_median_ratio media~1 mean_d~2 wilcox_~3
##    <chr>    <chr>       <chr>                   <dbl>   <dbl>    <dbl>     <dbl>
##  1 ab       lineage_II  lineage_I              0          0    0       NaN      
##  2 ac       lineage_II  lineage_I              0.126    624.   5.56e+2   1.34e-7
##  3 ad       lineage_II  lineage_I              0.0991    68   -4.57e+0   4.77e-1
##  4 ae       lineage_II  lineage_I             -0.795   -394.  -3.58e+2   5.75e-7
##  5 af       lineage_II  lineage_I              0.0316     5.5 -1.54e-2   9.76e-1
##  6 ag       lineage_II  lineage_I             -0.448    -65.5 -6.67e+1   1.77e-5
##  7 ah       lineage_II  lineage_I             -0.257    -11.5 -1.05e+1   1.93e-1
##  8 ai       lineage_II  lineage_I             -0.665    -20.5 -2.11e+1   1.10e-6
##  9 aj       lineage_II  lineage_I             -0.115     -1   -7.21e-1   7.78e-1
## 10 ak       lineage_II  lineage_I             -0.445     -6.5 -6.32e+0   1.23e-3
## # ... with 447 more rows, and abbreviated variable names 1: median_diff,
## #   2: mean_diff, 3: wilcox_p_value

NOTE that the “func” argument of the compare_groups() function allows you to use a custom function to calculate the significant of differential abudance tests. I have not tried to change the standard values, but you might be able to implement zimbwave in here.

3.6.4 - Plot heat tree showing log-fold differences

tan indicates lineage II, cyan indicates lineage I

taxmap_obj %>%
  metacoder::filter_taxa(grepl(pattern = "^[a-zA-Z]+$", taxon_names)) %>% # remove "odd" taxa
  heat_tree(node_label = taxon_names,
            node_size = n_obs,
            node_color = log2_median_ratio,
            node_color_interval = c(-3, 3), # The range of `log2_median_ratio` to display
            node_color_range = c("cyan", "gray", "tan"), # The color palette used
            layout = "davidson-harel",
            node_color_axis_label = "Log2 ratio, cyan for Lineage I",
            initial_layout = "reingold-tilford",
            output_file = "./results/heat_tree_diff_lineages.pdf")

3.6.5 - Supress non-significant differences and plot log changes again

This plot indicates log fold changes, but includes p values > 0.05. let’s run any non-significant differences into a median log ratio of zero, shading them as grey in the plot

# set differential log ratio to 0 based on p values below 0.05
taxmap_obj$data$diff_table_2lineages$log2_median_ratio[taxmap_obj$data$diff_table_2lineages$wilcox_p_value > 0.05] <- 0

# plot the heatmap
taxmap_obj %>%
  metacoder::filter_taxa(grepl(pattern = "^[a-zA-Z]+$", taxon_names)) %>% # remove "odd" taxa
  heat_tree(node_label = taxon_names,
            node_size = n_obs,
            node_color = log2_median_ratio,
            node_color_interval = c(-3, 3), # The range of `log2_median_ratio` to display
            node_color_range = c("cyan", "gray", "tan"), # The color palette used
            layout = "davidson-harel",
            node_color_axis_label = "Log2 ratio, cyan for Lineage I",
            initial_layout = "reingold-tilford",
            output_file = "./results/heat_tree_diff_lineages_pcut.pdf")

From this figure, we could see a larger presence of low-diversity phyla in lineage I (cyan). lineage II has a clear increase in Sphingomonadaceae. both lineages seem to be colonized by several, but different, bacteroidia and actinobacteria. Family Comamonadaceae and Chitinophagaceae seem to be the most diverse

QUESTION BREAK

3.7 - Compare 3 or more groups of samples

If you have more than 2 groups, like we have 3 different stresses, you can plot a matrix of heat trees with a dedicated function from the same package. Before we do that, let’s first calculate taxon abundance, prevalence, and log-fold changes like we did in chunks 3.5.1 to 3.5.3

3.7.1 - Calculate abudance, prevalence and log-fold in 3 treatment groups

#abundance per taxon was already calculated in chunk 3.5.1, now we just overwrite
taxmap_obj$data$tax_abund<-calc_taxon_abund(obj = taxmap_obj, 
                                            data = "otu_table", 
                                            cols = taxmap_obj$data$sample_data$sample_id) 

#get occurrence of ASVs per treatment
taxmap_obj$data$tax_occ_3treatments <- calc_n_samples(obj = taxmap_obj, 
                                                      data = "tax_abund", 
                                                      cols = taxmap_obj$data$sample_data$sample_id,
                                                      groups = taxmap_obj$data$sample_data$Stress) # now refer to Stress, that has 3 groups

# calculate log2 median ratios and p values for a wilcoxcon test within taxas in this stress treatment groups
taxmap_obj$data$diff_table_3treatments <- compare_groups(obj = taxmap_obj,
                                                        data = "tax_abund",
                                                        cols = taxmap_obj$data$sample_data$sample_id, 
                                                        groups = taxmap_obj$data$sample_data$Stress) 

# set differential log ratio to 0 based on adjusted p values
taxmap_obj$data$diff_table_3treatments$log2_median_ratio[taxmap_obj$data$diff_table_3treatments$wilcox_p_value > 0.05] <- 0

# because of potential ambiguity in non-standard evaluation, we should change this column name. this is only needed because of the column taxmap_obj$data$diff_table_2lineages$log2_median_ratio we created in chunk 3.5.3 
names(taxmap_obj$data$diff_table_3treatments)[4]<-"log2_median_ratio_stress"

3.7.2 - Plot a matrix of heat trees

Now that we have log-fold differences in pairwise treatment comparisons, let’s compare them in a matrix of heat trees

#plot matrix tree
set.seed(1)
heat_tree_matrix(taxmap_obj,
                 data = "diff_table_3treatments", # this is the table with the data you want to plot
                 node_size = n_obs, # n_obs is a function that calculates, in this case, the number of OTUs per taxon
                 node_label = taxon_names,
                 node_color = log2_median_ratio_stress, # A column from `taxmap_obj$data$diff_table_3treatments`
                 node_color_range = diverging_palette(), # The built-in palette for diverging data
                 node_color_trans = "linear", # The default is scaled by circle area
                 node_color_interval = c(-3, 3), # The range of `log2_median_ratio` to display
                 edge_color_interval = c(-3, 3), # The range of `log2_median_ratio` to display
                 node_size_axis_label = "Number of OTUs",
                 node_color_axis_label = "Log2 ratio median proportions",
                 layout = "davidson-harel", # The primary layout algorithm
                 initial_layout = "reingold-tilford", # The layout algorithm that initializes node locations
                 output_file = "./results/matrix_heat_tree_Stress.pdf") # Saves the plot as a pdf file

3.7.3 - EXERCISE change grouping variable

by changing the code above, Make a matrix of heat trees that plots the differences between the 3 plant species we are using. The metadata for that is present in the metadata column “sp_full_name”. Which alphaproteobacteria are more common in Brassica oleracea than Isatis tinctoria?

QUESTION BREAK

3.8 - Create a matrix of heat trees from a phyloseq object with a single function

These trees are interesting, but as you start exploring real data and multiple options you will need a more efficiency way of running all this code. For this, we create a custom function that handles all of it at once,

3.8.1 - Define a custom function

This custom function will:

1 - truncate the taxonomy table to genus level,
2 - remove the g__ and f__ etc from taxonomy table 3 - parse the phyloseq object into a taxmap object, 4 - calculate taxa occurrence, abundance and log-fold differences between groups of samples that you define 5 - plot a heat tree

It’s arguments are 1 - ps_object = a phyloseq object 2 - sample_group = a (quoted) column from the sample metadata that you want to compare

The function will count the number of different groups in your metadata column to decide if it will plot a heat tree (2 treatments/groups) or matrix of heat trees (3+ treatments). the function should fail if your metadata only has 1 sample group!

phyloseq_to_heat_tree_matrix<-function(ps_object, sample_group){
  
  # this function output is a heat trees comparing metadata. the input is based on a phyloseq object
  
  # ps object =  a phyloseq object, containing sample metadata, otu table, and taxonomy table
  # sample_group = the name of a column in your emtadata that you want to compare in the heat tree. it has to be quoted.
  
  # this function will create a matrix of heat trees if your metadata has more than 2 groups. it should fail if it has only 1 group
  

#remove unecessary taxonomic info (dada2id, "s__" and "ASV_id) by updating the tax table with a subset of the tax table
tax_table(ps_object)<-tax_table(ps_object)[,1:6]


# let's remove the "r__"ranks from the taxonomy, they can be useful but will pollute our plot
tax_table(ps_object)[, colnames(tax_table(ps_object))] <- gsub(pattern = "[a-z]__", # regular expression pattern to search for
                                                                   x = tax_table(ps_object)[, colnames(tax_table(ps_object))], # "df"
                                                                   replacement = "") # replacement for pattern
# transform from phyloseq to taxmap object
taxmap_obj<-parse_phyloseq(ps_object)

#get abundance per taxon
taxmap_obj$data$tax_abund<-calc_taxon_abund(obj = taxmap_obj, 
                                      data = "otu_table",
                                      cols = taxmap_obj$data$sample_data$sample_id) 

#get occurrence of ASVs per treatment
# the sample groups needs some wrangling to fit within the soft code of the function
taxmap_obj$data$tax_occ<- calc_n_samples(obj = taxmap_obj, 
                                                      data = "tax_abund", 
                                                      cols = taxmap_obj$data$sample_data$sample_id,
                                                      groups = taxmap_obj$data$sample_data[colnames(taxmap_obj$data$sample_data)==sample_group][[1]]) 

# calculate log2 median ratios and p values for a wilcoxcon test within taxas in this stress treatment groups
# the sample groups needs some wrangling to fit within the soft code of the function
taxmap_obj$data$diff_table <- compare_groups(obj = taxmap_obj,
                                                        data = "tax_abund",
                                                        cols = taxmap_obj$data$sample_data$sample_id, 
                                                        groups = taxmap_obj$data$sample_data[colnames(taxmap_obj$data$sample_data)==sample_group][[1]]) 

# set differential log ratio to 0 based on adjusted p values
taxmap_obj$data$diff_table$log2_median_ratio[taxmap_obj$data$diff_table$wilcox_p_value > 0.05] <- 0

# define number of compared factors
factors_compared<-taxmap_obj$data$sample_data[colnames(taxmap_obj$data$sample_data)==sample_group][[1]] 

# draw the plot based on an if else statement: if there are 2 groups, plot a a heat tree comparing abundances between both groups, else plot a matrix of ehat trees. this function will fail if you only have 1 sample group! 

if (length(unique(factors_compared)) == 2) {

set.seed(1)
 output<- taxmap_obj %>%
heat_tree(
            node_label = taxon_names,
      #      data = "diff_table", 
            node_size = n_obs,
            node_color = log2_median_ratio,
            node_color_interval = c(-3, 3), # The range of `log2_median_ratio` to display
            node_color_range = c("cyan", "gray", "tan"), # The color palette used
            layout = "davidson-harel",
            initial_layout = "reingold-tilford")

 } else {

set.seed(1)
output<-heat_tree_matrix(taxmap_obj,
                         data = "diff_table", # this is the table with the data you want to plot
                         node_size = n_obs, # n_obs is a function that calculates, in this case, the number of OTUs per taxon
                         node_label = taxon_names,
                         node_color = log2_median_ratio, # A column from `taxmap_obj$data$diff_table_3treatments`
                         node_color_range = diverging_palette(), # The built-in palette for diverging data
                         node_color_interval = c(-3, 3), # The range of `log2_median_ratio` to display
                         edge_color_interval = c(-3, 3), # The range of `log2_median_ratio` to display
                         node_size_axis_label = "Number of OTUs",
                         node_color_axis_label = "Log2 ratio median proportions",
                         layout = "davidson-harel", # The primary layout algorithm
                         initial_layout = "reingold-tilford") # The layout algorithm that initializes node locations

   
   
 }


# clearly define the output object you will get from the function
return(output)   


}

3.8.2 - Creating heat trees from a custom function

Now that we defined a new function that takes a phyloseq object as a data input and a column of the metadata as an argument, we can more easily produce heat trees across the variables.

# run custom function
heat_tree_stress<-phyloseq_to_heat_tree_matrix(ps_object = ps_rarefied, sample_group = "Stress")
heat_tree_lineage<-phyloseq_to_heat_tree_matrix(ps_object = ps_rarefied, sample_group = "Sp_Lineage_Walden")

# check stress tree
heat_tree_stress

# check lineage tree
heat_tree_lineage

3.8.3 - EXERCISE make more heat trees!

Now that you know the basics, let’s manipulate some of these visualizations

3.8.3.1 - EXERCISE use custom function

Now produce new heat tree, based on metadata like “Sp_full_name”, “greenhouse_compartment” or “Speed”!

3.8.3.2 - EXERCISE see Burkholderiales and Rhizobiales at ASV level

Our trees so far stopped at genus level, because ploting 2000 ASVs would overcrowd the plot. Show the “Sp_full_name” effect at ASV level, but only on Burkholderiales and Rhizobiales. What do you have to change, and where? You can copy-paste and adjust the code we saw above, or make a new version of the custom function

3.8.3.3 - EXERCISE remove kingdom from taxonomy table

Our trees so far have k__Bacteria as a shared root. What will happen if you remove that taxonomic information? to make this change, What do you have to alter? You can copy-paste and adjust the code we saw above, or make a new version of the custom function

4.0 - Create a list of taxmap objects for a single heat tree

Now that we have a function that takes a single phyloseq object as input, let’s run it across all the species in the dataset

4.1 - Create a list of phyloseq objects

First, let’s split the phyloseq object into a list of phyloseq objects according the species

note that the phyloseq input is based on the original phyloseq object, that still contains the ranks as p__Proteobacteria, o__Rhizobiales, etc

# turn a ps object into a list f ps objects
ps_l<- metagMisc::phyloseq_sep_variable(ps_rarefied, variable = "Sp_abb_name")

4.2 - Check list of phyloseq objects

Let’s check the list we just created. there are different ways we can do this

# check object, which is a list of phyloseq objects (ps_l)
ps_l
## $At
## phyloseq-class experiment-level object
## otu_table()   OTU Table:         [ 963 taxa and 48 samples ]
## sample_data() Sample Data:       [ 48 samples by 44 sample variables ]
## tax_table()   Taxonomy Table:    [ 963 taxa by 7 taxonomic ranks ]
## refseq()      DNAStringSet:      [ 963 reference sequences ]
## 
## $Bo_M
## phyloseq-class experiment-level object
## otu_table()   OTU Table:         [ 994 taxa and 48 samples ]
## sample_data() Sample Data:       [ 48 samples by 44 sample variables ]
## tax_table()   Taxonomy Table:    [ 994 taxa by 7 taxonomic ranks ]
## refseq()      DNAStringSet:      [ 994 reference sequences ]
## 
## $It
## phyloseq-class experiment-level object
## otu_table()   OTU Table:         [ 988 taxa and 47 samples ]
## sample_data() Sample Data:       [ 47 samples by 44 sample variables ]
## tax_table()   Taxonomy Table:    [ 988 taxa by 7 taxonomic ranks ]
## refseq()      DNAStringSet:      [ 988 reference sequences ]
# access the Arabidosps phyloseq object
ps_l$At
## phyloseq-class experiment-level object
## otu_table()   OTU Table:         [ 963 taxa and 48 samples ]
## sample_data() Sample Data:       [ 48 samples by 44 sample variables ]
## tax_table()   Taxonomy Table:    [ 963 taxa by 7 taxonomic ranks ]
## refseq()      DNAStringSet:      [ 963 reference sequences ]
# access the taxonomy table of the  Arabidosps phyloseq object
ps_l$At@tax_table[1:5,]
## Taxonomy Table:     [5 taxa by 7 taxonomic ranks]:
##         Kingdom       Phylum                Class                   
## bASV_5  "k__Bacteria" "p__Proteobacteria"   "c__Gammaproteobacteria"
## bASV_7  "k__Bacteria" "p__Proteobacteria"   "c__Gammaproteobacteria"
## bASV_9  "k__Bacteria" "p__Bacteroidota"     "c__Bacteroidia"        
## bASV_11 "k__Bacteria" "p__Proteobacteria"   "c__Alphaproteobacteria"
## bASV_12 "k__Bacteria" "p__Actinobacteriota" "c__Actinobacteria"     
##         Order                 Family                 Genus            
## bASV_5  "o__Burkholderiales"  "f__Comamonadaceae"    "g__Acidovorax"  
## bASV_7  "o__Pseudomonadales"  "f__Pseudomonadaceae"  "g__Pseudomonas" 
## bASV_9  "o__Chitinophagales"  "f__Chitinophagaceae"  "g__Niastella"   
## bASV_11 "o__Sphingomonadales" "f__Sphingomonadaceae" "g__Sphingomonas"
## bASV_12 "o__Streptomycetales" "f__Streptomycetaceae" "g__Streptomyces"
##         ASV_id   
## bASV_5  "bASV_5" 
## bASV_7  "bASV_7" 
## bASV_9  "bASV_9" 
## bASV_11 "bASV_11"
## bASV_12 "bASV_12"
# check metadata of the second list object
sample_data(object = ps_l[[2]])[1:5,1:5]
##        Target Harvest_ID Block_row_column Sp_speed  Stress
## 49_16S    16S         49            1.2.4       M1 Control
## 50_16S    16S         50            2.3.1       M1 Control
## 51_16S    16S         51            3.2.9       M1 Control
## 52_16S    16S         52            4.1.6       M1 Control
## 53_16S    16S         53           5.6.14       M1 Control

4.4 - Run custom function on list of phyloseq objects

lapply (list apply) will apply a function over a list. as we have a list of phyloseq objects now, it will be easy to produce several matrices of heat trees. when working on lists of 33 species we learned that 33 plots are hardly helpful. still, the code and the concept (make a custom function that uses a phyloseq object as input, then run over a list with lapply) can be very useful on your own analysis

# are you new to lapply? see here how it works. you first provide a list ("ps_l"), and then a function ("ntaxa") you will run independently on each element x of your list(ps_l$At, ps_l$Bo_M, ps_l$It). here we simply count the number of taxa of each of the 3 phyloseq objects in the list
lapply(ps_l, function(x)  
  ntaxa(x))
## $At
## [1] 963
## 
## $Bo_M
## [1] 994
## 
## $It
## [1] 988
# run the custom function over a list of phyloseq objects... it can take a moment
heat_tree_stress_l<-lapply(ps_l, function(x)  
  phyloseq_to_heat_tree_matrix (ps_object = x, sample_group = "Stress"))

QUESTION BREAK

5.0 - Adding external data to the heat trees

What if we want to visualize other types of taxa-associated data, like p values from deseq2, importance in random forest predictions, distances in an ordination, or degree in a network? You can do that, as long as the vector of values you use as node_size or node_colour have the same length as the number of taxon names you are plotting. You can for example add another data layer to the taxmap object, or provide data from an independent df as the argument of node_size or node_colour.

5.1 - Check some input arguments of a heat tree

To explore this, lets first check some of the taxmap object structure, then create some random data and add it to the heat tree

# let's review one of the more basic trees.
taxmap_obj %>%
  heat_tree(node_label = taxon_names,
            node_size = n_obs,
            node_color = n_obs,
            layout = "davidson-harel",
            initial_layout = "reingold-tilford")

#let's review what is actually used as input for node size and color
n_obs(taxmap_obj)
##   ab   ac   ad   ae   af   ag   ah   ai   aj   ak   al   am   an   ap   aq   ar 
## 1000  574  102   84   79   59   15   11    3    3   18   21    8   14    4    1 
##   as   at   au   av   aw   ax   ay   az   ba   bb   bc   bd   be   bf   bg   bh 
##    1    1  359  102  215   64   23   57   13   11   24    3    3   18   18    8 
##   bj   bk   bl   bm   bn   bo   bp   bq   br   bs   bt   bu   bv   bw   bx   by 
##    8    4    2    4    2    5    2   14   24    3    1    1    3    4    1    1 
##   bz   ca   cb   cc   cd   ce   cf   cg   ch   ci   cj   ck   cl   cm   cn   co 
##    1  272    8   53   77    9   82   32   10   62   15    8   57   11   21    1 
##   cp   cq   cr   cs   ct   cu   cv   cw   cx   cy   cz   da   db   dc   de   df 
##    5   17   17    1    6    1    9    3    3    6    8    5   18    8    8   14 
##   dg   dh   di   dj   dk   dl   dm   dn   do   dp   dq   dr   ds   dt   du   dv 
##    2    4    3    2    4    8    2    2    4    2    5   22    8    5    5    2 
##   dw   dx   dy   dz   ea   eb   ec   ed   ee   ef   eg   eh   ei   ej   ek   el 
##    7    1    3    9    6    3    3    2    1    1    1    2    5    1    1    2 
##   em   en   eo   ep   eq   er   es   et   eu   ew   ex   ey   ez   fa   fc   fd 
##    3    1    2    3    1    2    1    1    1    1    1    1    1    1  100    8 
##   fe   ff   fg   fh   fi   fj   fk   fl   fm   fn   fo   fp   fq   fr   fs   ft 
##   52   77    9   25  111   29   31    4    3   44    7    9   18    4    6   57 
##   fu   fv   fw   fx   fy   fz   ga   gb   gc   gd   ge   gf   gg   gh   gi   gj 
##   11   21    1    5   15    4   17    1    6    1    9    3    3    6    8   14 
##   gk   gl   gm   gn   gp   gq   gr   gs   gt   gu   gv   gw   gx   gy   gz   ha 
##    5   16    4    7    4    8   14   19    2    1    3    2    4    4    4    5 
##   hb   hc   hd   he   hf   hg   hh   hi   hj   hk   hl   hm   hn   ho   hp   hq 
##    8    2   10    3    2    2    2    2    2    8    4    2    8    3    5    2 
##   hr   hs   ht   hu   hv   hw   hx   hy   hz   ia   ib   ic   id   ie   if   ig 
##   14    1    1    1    3    2    9    6    3    3    1    1    3    2    1    1 
##   ih   ii   ij   ik   il   im   in   io   ip   iq   ir   it   iu   iv   iw   ix 
##    1    2    5    1    1    3    4    2    2    3    3    3    1    2    3    1 
##   iy   iz   ja   jb   jc   jd   je   jf   jg   ji   jj   jk   jl   jm   jn   jp 
##    1    2    1    1    2    1    1    1    1    1    1    1    1    1    1    1 
##   jq   jr   js   jt   ju   jv   jw   jy   ka   kb   kc   kh   kj   kk   km   kn 
##    6    8   30   36    8   11   13   67    5    1    1    4    3    2    5    4 
##   ko   kp   kq   kr   ks   kt   ku   kv   kw   kx   ky   kz   la   lb   lc   ld 
##    4   10    9    2    3    4   22    3    4    6    7    1    2    9    1    8 
##   le   lf   lg   lh   li   lj   lk   ll   lm   ln   lo   lp   lq   lr   ls   lt 
##    4   24   27    2    2    1    1    2    5   12    3    5    3    4    9    1 
##   lu   lw   lx   ly   lz   ma   mb   mc   me   mf   mg   mh   mi   mj   mk   ml 
##    1    1    9    3    3    3    3    4    6    6   10    1    2    5   16    4 
##   mm   mn   mo   mp   mq   ms   mt   mv   mw   my   mz   na   nb   nc   nd   ne 
##    2    3    2    6    2    2    8   19    3    1    1    9    3    2    7    2 
##   nf   ng   nh   ni   nj   nk   nl   nm   nn   no   np   nq   nr   ns   nt   nu 
##    3    3    3    5    4    4    2    3    8    1    3    1    4    9    3    2 
##   nv   nw   nx   ny   nz   oa   ob   oc   od   oe   of   og   oh   oi   oj   ok 
##    2    2   10    3    1    2    2    1    4   12   10    1    1    2    2    1 
##   ol   om   on   oo   op   oq   or   os   ot   ou   ov   ow   ox   oy   oz   pa 
##    6    1    2    4    1    1    7    3    2    5    2   14    2    3    1    1 
##   pb   pc   pd   pe   pf   pg   ph   pi   pj   pk   pl   pm   pn   po   pp   pq 
##    1    1    1    3    1    2    9    4    1    1    6    3    1    3    1    1 
##   pr   ps   pt   pv   pw   py   pz   qb   qc   qd   qe   qf   qg   qh   qj   qk 
##    1    1    3    5    2    1    1    1    1    2    3    1    1    3    2    3 
##   ql   qm   qn   qo   qp   qq   qr   qs   qt   qu   qv   qx   qy   qz   ra   rb 
##    1    1    2    2    2    2    4    2    3    1    1    1    1    1    2    5 
##   rc   re   rf   rg   rh   ri   rj   rk   rl   rn   ro   rp   rq   rr   rs   rt 
##    2    3    1    1    1    1    1    1    2    1    2    1    1    1    1    1 
##   ru   rv   rw   rx   ry   rz   sa   sb   sc   sd   se   sf   sg   sh   sj   sk 
##    1    1    1    1    1    1    2    1    1    1    1    1    1    1    1    1 
##   sl   sm   sn   so   sp   sq   sr   st   su 
##    1    1    1    1    1    1    1    1    1
# ab, es, ff etc all refer to specific nodes int he tree taxonomic groups. the number referts to how many sequences belong to that taxonomy (all 1000 sequences, are part of Bacteria "ab", 574 are Proteobacteria "ac")
taxmap_obj$taxa[1:10]
## $ab
## <Taxon>
##   name: Bacteria
##   rank: Kingdom
##   id: none
##   authority: none
## 
## $ac
## <Taxon>
##   name: Proteobacteria
##   rank: Phylum
##   id: none
##   authority: none
## 
## $ad
## <Taxon>
##   name: Bacteroidota
##   rank: Phylum
##   id: none
##   authority: none
## 
## $ae
## <Taxon>
##   name: Actinobacteriota
##   rank: Phylum
##   id: none
##   authority: none
## 
## $af
## <Taxon>
##   name: Acidobacteriota
##   rank: Phylum
##   id: none
##   authority: none
## 
## $ag
## <Taxon>
##   name: Gemmatimonadota
##   rank: Phylum
##   id: none
##   authority: none
## 
## $ah
## <Taxon>
##   name: Myxococcota
##   rank: Phylum
##   id: none
##   authority: none
## 
## $ai
## <Taxon>
##   name: Firmicutes
##   rank: Phylum
##   id: none
##   authority: none
## 
## $aj
## <Taxon>
##   name: Methylomirabilota
##   rank: Phylum
##   id: none
##   authority: none
## 
## $ak
## <Taxon>
##   name: Armatimonadota
##   rank: Phylum
##   id: none
##   authority: none

So if you have a vector from the same length of n_obs(taxmap_obj) with data from other analysis, you can visualize other outputs in a heat tree! lets prepare a df that will add new, random data

5.2 - Extract taxon information from the taxmap object

let’s extract the taxon ids from the taxmap object

#extract the names of the taxonomies in the heat tree, accoding the ab ac etc order
taxon_id_metacoder<-lapply(taxmap_obj$taxa, function (x)
                                                    x$get_name())%>%
                                                    map(1)

# now turn that list into a df
taxon_id_metacoder<-do.call(rbind.data.frame, map(taxon_id_metacoder,1))

# and change column name
colnames(taxon_id_metacoder)<-"taxa_id"

#check output
head(taxon_id_metacoder)
##            taxa_id
## 1         Bacteria
## 2   Proteobacteria
## 3     Bacteroidota
## 4 Actinobacteriota
## 5  Acidobacteriota
## 6  Gemmatimonadota

5.3 - Check some input arguments of a heat tree

then create new (random) data, and add it to the taxon information you extracted

#make up some random data called new_data1
set.seed(1)
new_data1<-c(runif(n=100, min=1, max=10),
            runif(n=200, min=5, max=15),
            runif(n=100, min=10, max=20),
            runif(n=57, min=20, max=25))

#make up some random data called new_data2
set.seed(1)
new_data2<-c(runif(n=57, min=1000, max=2000),
             runif(n=200, min=600, max=1000),
             runif(n=200, min=200, max=500))


# add new (random example) data into the df
taxon_id_metacoder$new_analysis_output1<-new_data1
taxon_id_metacoder$new_analysis_output2<-new_data2

#this is the df we created using data external from metacoder
taxon_id_metacoder[1:10,]
##              taxa_id new_analysis_output1 new_analysis_output2
## 1           Bacteria             3.389578             1265.509
## 2     Proteobacteria             4.349115             1372.124
## 3       Bacteroidota             6.155680             1572.853
## 4   Actinobacteriota             9.173870             1908.208
## 5    Acidobacteriota             2.815137             1201.682
## 6    Gemmatimonadota             9.085507             1898.390
## 7        Myxococcota             9.502077             1944.675
## 8         Firmicutes             6.947180             1660.798
## 9  Methylomirabilota             6.662026             1629.114
## 10    Armatimonadota             1.556076             1061.786

5.4 - Make a new heat tree with the new random data

the size of the node will relate to the first column of the new dataframe. the color of the node will related the the second column of the new dataframe

in a real case, you would have your valuable data, taxon by taxon, in a dataframe with similar structure

# add the new data into the heat tree!
taxmap_obj %>%
  heat_tree(node_label = taxon_names,
            node_size = taxon_id_metacoder$new_analysis_output1,
            node_color = taxon_id_metacoder$new_analysis_output2,
            node_size_axis_label = "Size: new data 2",
            node_color_axis_label = "Color: new data 1",
            layout = "davidson-harel",
            initial_layout = "reingold-tilford")

Of course, there are countless ways you can integrate different pieces of data together - such as by matching the taxonomies ab, ac, ad etc; the names of the taxa like “f__Burkholderiales” etc. Note that you won’t find sample code for this (adding external data) in the metacoder package documentation.

QUESTION BREAK

6.0 - adding results of a fisher tests to a heat tree

In the MeJA applications pilot were we tested different MeJA concentrations, we run different analysis that highlight taxa as “important”: predictors for stress in random forest, keystone nodes in networks, and differentially abundant ASVs in treatment-control comparisons. to select which of the taxonomies of these “important” taxa to focus on, we performed a fisher test to check if the proportions of a taxonomy are similar between the important ASVs and the ASVs that were not defined as important.

The paper with further details should be submitted for publication this by Nov/Dez 2022 ~ you can cite that paper in the near future to use this code

6.1 - Load important taxa names, create new ps object

These ASVs listed in “imp_asv_list.Rdata” were highlighted as differential abundant due to stress treatments OR important in random forest predictions for stress treatments OR network keystone/connector/hub of the species in a more complete dataset (with 70k ASVs). I call these the “important”ASVs.

After we load these ASVs names, we create a list of new ps objects by running phyloseq::prune_taxa (a function that selects taxa in a ps object according a character vector) with base::mapply (a function that runs a function on two lists simultaneously)

# load a list of ASV names. 
load( file="./data/imp_asv_list.Rdata")

imp_asv_list # as you see, this is just a list of taxa
## $At
##   [1] "bASV_9"     "bASV_29"    "bASV_71"    "bASV_97"    "bASV_108"  
##   [6] "bASV_155"   "bASV_169"   "bASV_175"   "bASV_314"   "bASV_391"  
##  [11] "bASV_443"   "bASV_674"   "bASV_776"   "bASV_1058"  "bASV_1085" 
##  [16] "bASV_1526"  "bASV_1722"  "bASV_2147"  "bASV_4416"  "bASV_1008" 
##  [21] "bASV_1035"  "bASV_105"   "bASV_1054"  "bASV_106"   "bASV_111"  
##  [26] "bASV_115"   "bASV_117"   "bASV_119"   "bASV_12"    "bASV_120"  
##  [31] "bASV_1244"  "bASV_13"    "bASV_132"   "bASV_133"   "bASV_134"  
##  [36] "bASV_138"   "bASV_1400"  "bASV_1421"  "bASV_143"   "bASV_154"  
##  [41] "bASV_158"   "bASV_162"   "bASV_165"   "bASV_166"   "bASV_190"  
##  [46] "bASV_192"   "bASV_200"   "bASV_202"   "bASV_205"   "bASV_21"   
##  [51] "bASV_218"   "bASV_224"   "bASV_226"   "bASV_233"   "bASV_234"  
##  [56] "bASV_254"   "bASV_274"   "bASV_279"   "bASV_280"   "bASV_289"  
##  [61] "bASV_292"   "bASV_30"    "bASV_31"    "bASV_319"   "bASV_32"   
##  [66] "bASV_333"   "bASV_337"   "bASV_339"   "bASV_347"   "bASV_36"   
##  [71] "bASV_369"   "bASV_37"    "bASV_38"    "bASV_39"    "bASV_395"  
##  [76] "bASV_40"    "bASV_406"   "bASV_424"   "bASV_4245"  "bASV_428"  
##  [81] "bASV_439"   "bASV_442"   "bASV_458"   "bASV_46"    "bASV_48"   
##  [86] "bASV_493"   "bASV_50"    "bASV_514"   "bASV_515"   "bASV_517"  
##  [91] "bASV_530"   "bASV_531"   "bASV_557"   "bASV_573"   "bASV_58"   
##  [96] "bASV_59"    "bASV_599"   "bASV_60"    "bASV_605"   "bASV_61"   
## [101] "bASV_622"   "bASV_63"    "bASV_638"   "bASV_64"    "bASV_65"   
## [106] "bASV_66"    "bASV_698"   "bASV_7"     "bASV_709"   "bASV_75"   
## [111] "bASV_76"    "bASV_762"   "bASV_81"    "bASV_83"    "bASV_878"  
## [116] "bASV_892"   "bASV_90"    "bASV_964"   "bASV_98"    "bASV_99"   
## [121] "bASV_153"   "bASV_845"   "bASV_1590"  "bASV_2049"  "bASV_889"  
## [126] "bASV_6729"  "bASV_10033" "bASV_11577" "bASV_12011" "bASV_22991"
## [131] "bASV_24722" "bASV_27301" "bASV_35730" "bASV_38129" "bASV_67851"
## [136] "bASV_28909" "bASV_7928" 
## 
## $Bo_M
##   [1] "bASV_18"   "bASV_29"   "bASV_53"   "bASV_66"   "bASV_67"   "bASV_85"  
##   [7] "bASV_143"  "bASV_226"  "bASV_286"  "bASV_291"  "bASV_339"  "bASV_417" 
##  [13] "bASV_418"  "bASV_550"  "bASV_814"  "bASV_999"  "bASV_1018" "bASV_1238"
##  [19] "bASV_1305" "bASV_1464" "bASV_2760" "bASV_2999" "bASV_3347" "bASV_3617"
##  [25] "bASV_102"  "bASV_104"  "bASV_106"  "bASV_110"  "bASV_112"  "bASV_115" 
##  [31] "bASV_118"  "bASV_12"   "bASV_123"  "bASV_1245" "bASV_13"   "bASV_132" 
##  [37] "bASV_134"  "bASV_1346" "bASV_137"  "bASV_139"  "bASV_145"  "bASV_146" 
##  [43] "bASV_15"   "bASV_164"  "bASV_166"  "bASV_170"  "bASV_174"  "bASV_181" 
##  [49] "bASV_195"  "bASV_196"  "bASV_203"  "bASV_209"  "bASV_214"  "bASV_223" 
##  [55] "bASV_230"  "bASV_244"  "bASV_251"  "bASV_255"  "bASV_261"  "bASV_266" 
##  [61] "bASV_282"  "bASV_309"  "bASV_310"  "bASV_321"  "bASV_326"  "bASV_345" 
##  [67] "bASV_349"  "bASV_355"  "bASV_358"  "bASV_376"  "bASV_404"  "bASV_439" 
##  [73] "bASV_45"   "bASV_474"  "bASV_48"   "bASV_481"  "bASV_542"  "bASV_55"  
##  [79] "bASV_58"   "bASV_590"  "bASV_591"  "bASV_613"  "bASV_62"   "bASV_64"  
##  [85] "bASV_73"   "bASV_731"  "bASV_754"  "bASV_763"  "bASV_780"  "bASV_793" 
##  [91] "bASV_839"  "bASV_91"   "bASV_97"   "bASV_98"   "bASV_136"  "bASV_320" 
##  [97] "bASV_1185" "bASV_278"  "bASV_594"  "bASV_1112" "bASV_316"  "bASV_664" 
## [103] "bASV_775" 
## 
## $It
##   [1] "bASV_11"    "bASV_46"    "bASV_54"    "bASV_55"    "bASV_58"   
##   [6] "bASV_85"    "bASV_102"   "bASV_109"   "bASV_129"   "bASV_190"  
##  [11] "bASV_216"   "bASV_223"   "bASV_259"   "bASV_314"   "bASV_330"  
##  [16] "bASV_400"   "bASV_520"   "bASV_662"   "bASV_716"   "bASV_729"  
##  [21] "bASV_766"   "bASV_776"   "bASV_838"   "bASV_1134"  "bASV_1232" 
##  [26] "bASV_1469"  "bASV_1846"  "bASV_2793"  "bASV_1019"  "bASV_1039" 
##  [31] "bASV_111"   "bASV_116"   "bASV_12"    "bASV_1228"  "bASV_124"  
##  [36] "bASV_1242"  "bASV_142"   "bASV_149"   "bASV_15"    "bASV_1532" 
##  [41] "bASV_154"   "bASV_160"   "bASV_161"   "bASV_162"   "bASV_175"  
##  [46] "bASV_180"   "bASV_182"   "bASV_1867"  "bASV_187"   "bASV_1959" 
##  [51] "bASV_20"    "bASV_207"   "bASV_21"    "bASV_215"   "bASV_217"  
##  [56] "bASV_219"   "bASV_230"   "bASV_246"   "bASV_249"   "bASV_25"   
##  [61] "bASV_261"   "bASV_270"   "bASV_274"   "bASV_277"   "bASV_280"  
##  [66] "bASV_312"   "bASV_32"    "bASV_323"   "bASV_350"   "bASV_362"  
##  [71] "bASV_369"   "bASV_372"   "bASV_384"   "bASV_392"   "bASV_40"   
##  [76] "bASV_407"   "bASV_425"   "bASV_438"   "bASV_44"    "bASV_443"  
##  [81] "bASV_45"    "bASV_451"   "bASV_476"   "bASV_477"   "bASV_496"  
##  [86] "bASV_504"   "bASV_508"   "bASV_52"    "bASV_569"   "bASV_589"  
##  [91] "bASV_590"   "bASV_593"   "bASV_601"   "bASV_602"   "bASV_62"   
##  [96] "bASV_682"   "bASV_69"    "bASV_691"   "bASV_705"   "bASV_725"  
## [101] "bASV_76"    "bASV_768"   "bASV_812"   "bASV_840"   "bASV_935"  
## [106] "bASV_946"   "bASV_210"   "bASV_1006"  "bASV_515"   "bASV_2687" 
## [111] "bASV_473"   "bASV_5751"  "bASV_4308"  "bASV_8032"  "bASV_30283"
## [116] "bASV_59443" "bASV_604"   "bASV_1093"  "bASV_394"   "bASV_1002"
# filter your list of phyloliseq objects to only contain the ASVs defined as "importat"
imp_ASV_ps_l<-mapply (function (list_1,list_2) #mapply will let you manipulate 2 or more lists at once
            prune_taxa(taxa = list_1, x = list_2), #here, x as the name of an argument of prune_taxa, and refers to a phyloseq object
          list_1 = imp_asv_list, # here you define what R object is list_1
          list_2 = ps_l, # here you define what R object is list_2
          SIMPLIFY = FALSE)

# this ps object only has "important" ASVs
imp_ASV_ps_l
## $At
## phyloseq-class experiment-level object
## otu_table()   OTU Table:         [ 114 taxa and 48 samples ]
## sample_data() Sample Data:       [ 48 samples by 44 sample variables ]
## tax_table()   Taxonomy Table:    [ 114 taxa by 7 taxonomic ranks ]
## refseq()      DNAStringSet:      [ 114 reference sequences ]
## 
## $Bo_M
## phyloseq-class experiment-level object
## otu_table()   OTU Table:         [ 92 taxa and 48 samples ]
## sample_data() Sample Data:       [ 48 samples by 44 sample variables ]
## tax_table()   Taxonomy Table:    [ 92 taxa by 7 taxonomic ranks ]
## refseq()      DNAStringSet:      [ 92 reference sequences ]
## 
## $It
## phyloseq-class experiment-level object
## otu_table()   OTU Table:         [ 102 taxa and 47 samples ]
## sample_data() Sample Data:       [ 47 samples by 44 sample variables ]
## tax_table()   Taxonomy Table:    [ 102 taxa by 7 taxonomic ranks ]
## refseq()      DNAStringSet:      [ 102 reference sequences ]

NOTE that the input is just a list of ASV names. you could use this function to summarize any other analysis method that tags several ASVs as relevant in your analysis

6.2 - define the funciton that calculates fisher tests and heat trees

If we have the time, we explore details for these functions. their description and comments should be fairly sufficiency

the fisher_all_taxa_groups function requires as input: 1) a phyloseq object that contains your “important” taxa 2) a phyloseq objects that contains all your taxa

the fisher_to_heatTree function requires as input 1) the output of the fisher_all_taxa_groups function 2) a phyloseq object that contains your “important” taxa

6.3 - check some of the proportions we will test

is a specific taxonomy overly frequent in the “important” dataset when compared to the taxonomies of the ASV that were NOT defined as important?

note that this phyloseq input is based on the original phyloseq object, that still contains the ranks as p__Proteobacteria, o__Rhizobiales, etc

# how many rhizobiales do we have in full set of 963 ASVs of A. thaliana?
ntaxa(physeq = subset_taxa(ps_l$At, Order == "o__Rhizobiales"))
## [1] 81
# how many rhizobiales do we have in the group of 114 "important" ASVs for A. thaliana?
ntaxa(physeq = subset_taxa(imp_ASV_ps_l$At, Order == "o__Rhizobiales"))
## [1] 18
# how many rhizobiales do we have in the set of 849 A. thaliana ASVs that were not tagged as important ?
unimp_ASV_ps_at<-prune_taxa(!(taxa_names(ps_l$At) %in% imp_asv_list$At), ps_l$At)
ntaxa(physeq = subset_taxa(unimp_ASV_ps_at, Order == "o__Rhizobiales"))
## [1] 63

is 18 in 114 (~15%) similar to 63 in 849 (~7%), considering we have in total 18+63 (81) rhizobiales in 114+849 (983) bASVs? if these proportions are statistically different, rhizobiales may be over represented in the important ASV set.

What are we actually testing here? if the important ASVs are just a random subset of the complete dataset, the proportion of rhizobiales in both datasets should be similar

In other words: the chance that these many rhizobiales were tagged as “important” just because there is a lot of rhizobiales in the full/background dataset is too small.

6.4 - calculate proportions of the fisher test

by running the functions defined in chunk #6.2, we test the proportions stated in chunk # 6.3 - on all taxonomic levels of your input phyloseq object

# run a fisher test comparin these proportions at all taxonomic levels
fisher_output_at<-fisher_all_taxa_groups(ps_important_taxa = imp_ASV_ps_l$At,
                                         ps_all_taxa = ps_l$At)

#check raw ouput of fisher tests for rhizobiales
fisher_output_at[1:3]
## $p__Acidobacteriota
## 
##  Fisher's Exact Test for Count Data
## 
## data:  matrix(c(target_in_important_n, all_taxa_in_important_n - target_in_important_n, target_in_all_n - target_in_important_n, all_taxa_in_all_n - all_taxa_in_important_n - target_in_all_n), ncol = 2)
## p-value = 0.8474
## alternative hypothesis: true odds ratio is greater than 1
## 95 percent confidence interval:
##  0.3153814       Inf
## sample estimates:
## odds ratio 
##  0.7106417 
## 
## 
## $p__Actinobacteriota
## 
##  Fisher's Exact Test for Count Data
## 
## data:  matrix(c(target_in_important_n, all_taxa_in_important_n - target_in_important_n, target_in_all_n - target_in_important_n, all_taxa_in_all_n - all_taxa_in_important_n - target_in_all_n), ncol = 2)
## p-value = 0.9985
## alternative hypothesis: true odds ratio is greater than 1
## 95 percent confidence interval:
##  0.06810393        Inf
## sample estimates:
## odds ratio 
##  0.2590167 
## 
## 
## $p__Bacteroidota
## 
##  Fisher's Exact Test for Count Data
## 
## data:  matrix(c(target_in_important_n, all_taxa_in_important_n - target_in_important_n, target_in_all_n - target_in_important_n, all_taxa_in_all_n - all_taxa_in_important_n - target_in_all_n), ncol = 2)
## p-value = 0.09193
## alternative hypothesis: true odds ratio is greater than 1
## 95 percent confidence interval:
##  0.9037439       Inf
## sample estimates:
## odds ratio 
##   1.557417
fisher_output_at$o__Rhizobiales
## 
##  Fisher's Exact Test for Count Data
## 
## data:  matrix(c(target_in_important_n, all_taxa_in_important_n - target_in_important_n, target_in_all_n - target_in_important_n, all_taxa_in_all_n - all_taxa_in_important_n - target_in_all_n), ncol = 2)
## p-value = 0.004991
## alternative hypothesis: true odds ratio is greater than 1
## 95 percent confidence interval:
##  1.345852      Inf
## sample estimates:
## odds ratio 
##   2.283077
# with map() we can easely extract key values from this output, like so:
str(fisher_output_at$o__Rhizobiales) # here we see that p values are the first list element of each taxa ouput
## List of 7
##  $ p.value    : num 0.00499
##  $ conf.int   : num [1:2] 1.35 Inf
##   ..- attr(*, "conf.level")= num 0.95
##  $ estimate   : Named num 2.28
##   ..- attr(*, "names")= chr "odds ratio"
##  $ null.value : Named num 1
##   ..- attr(*, "names")= chr "odds ratio"
##  $ alternative: chr "greater"
##  $ method     : chr "Fisher's Exact Test for Count Data"
##  $ data.name  : chr "matrix(c(target_in_important_n, all_taxa_in_important_n - target_in_important_n, target_in_all_n - target_in_im"| __truncated__
##  - attr(*, "class")= chr "htest"
map(fisher_output_at, 1) # here we extarct the first list element of each taxa in the list (p values)
## $p__Acidobacteriota
## [1] 0.8473672
## 
## $p__Actinobacteriota
## [1] 0.9985258
## 
## $p__Bacteroidota
## [1] 0.09193391
## 
## $p__Gemmatimonadota
## [1] 0.8433388
## 
## $p__Myxococcota
## [1] 0.09167539
## 
## $p__Proteobacteria
## [1] 0.1964258
## 
## $c__Acidobacteriae
## [1] 0.4959192
## 
## $c__Alphaproteobacteria
## [1] 0.3283449
## 
## $c__Bacteroidia
## [1] 0.09193391
## 
## $c__Gammaproteobacteria
## [1] 0.1752416
## 
## $c__Gemmatimonadetes
## [1] 0.8197096
## 
## $c__Polyangia
## [1] 0.0578331
## 
## $o__Burkholderiales
## [1] 0.5466162
## 
## $o__Caulobacterales
## [1] 0.3049718
## 
## $o__Chitinophagales
## [1] 0.2735582
## 
## $o__Cytophagales
## [1] 0.2593899
## 
## $o__Flavobacteriales
## [1] 0.0578331
## 
## $o__Gammaproteobacteria_Incertae_Sedis
## [1] 0.0584339
## 
## $o__Gemmatimonadales
## [1] 0.8197096
## 
## $o__Rhizobiales
## [1] 0.004991002
## 
## $o__Sphingomonadales
## [1] 0.9912935
## 
## $o__Xanthomonadales
## [1] 0.09126703
## 
## $f__Caulobacteraceae
## [1] 0.2803993
## 
## $f__Chitinophagaceae
## [1] 0.2554614
## 
## $f__Comamonadaceae
## [1] 0.07230109
## 
## $f__Devosiaceae
## [1] 0.00597422
## 
## $f__Gemmatimonadaceae
## [1] 0.8197096
## 
## $f__Oxalobacteraceae
## [1] 0.3631943
## 
## $f__Rhizobiaceae
## [1] 0.05558943
## 
## $f__Rhodanobacteraceae
## [1] 0.3626101
## 
## $f__Sphingomonadaceae
## [1] 0.9912935
## 
## $f__Unknown_Family
## [1] 0.0584339
## 
## $f__Xanthobacteraceae
## [1] 0.1200347
## 
## $f__Xanthomonadaceae
## [1] 0.1163108
## 
## $g__Acidibacter
## [1] 0.0584339
## 
## $`g__Allorhizobium-Neorhizobium-Pararhizobium-Rhizobium`
## [1] 0.1323728
## 
## $g__Gemmatimonas
## [1] 0.3740159
## 
## $g__Lysobacter
## [1] 0.1097332
## 
## $g__Massilia
## [1] 0.3344271
## 
## $g__Niastella
## [1] 0.01960319
map(fisher_output_at, 3) # here we extarct the third list element of each taxa in the list (odds ratio)
## $p__Acidobacteriota
## odds ratio 
##  0.7106417 
## 
## $p__Actinobacteriota
## odds ratio 
##  0.2590167 
## 
## $p__Bacteroidota
## odds ratio 
##   1.557417 
## 
## $p__Gemmatimonadota
## odds ratio 
##   0.684851 
## 
## $p__Myxococcota
## odds ratio 
##   2.752893 
## 
## $p__Proteobacteria
## odds ratio 
##   1.223502 
## 
## $c__Acidobacteriae
## odds ratio 
##   1.176164 
## 
## $c__Alphaproteobacteria
## odds ratio 
##   1.138922 
## 
## $c__Bacteroidia
## odds ratio 
##   1.557417 
## 
## $c__Gammaproteobacteria
## odds ratio 
##   1.233855 
## 
## $c__Gemmatimonadetes
## odds ratio 
##  0.7134832 
## 
## $c__Polyangia
## odds ratio 
##    3.37105 
## 
## $o__Burkholderiales
## odds ratio 
##  0.9959746 
## 
## $o__Caulobacterales
## odds ratio 
##   1.442557 
## 
## $o__Chitinophagales
## odds ratio 
##   1.366568 
## 
## $o__Cytophagales
## odds ratio 
##   1.876752 
## 
## $o__Flavobacteriales
## odds ratio 
##    3.37105 
## 
## $o__Gammaproteobacteria_Incertae_Sedis
## odds ratio 
##   4.534428 
## 
## $o__Gemmatimonadales
## odds ratio 
##  0.7134832 
## 
## $o__Rhizobiales
## odds ratio 
##   2.283077 
## 
## $o__Sphingomonadales
## odds ratio 
##  0.3304971 
## 
## $o__Xanthomonadales
## odds ratio 
##   1.718467 
## 
## $f__Caulobacteraceae
## odds ratio 
##    1.50201 
## 
## $f__Chitinophagaceae
## odds ratio 
##   1.400058 
## 
## $f__Comamonadaceae
## odds ratio 
##   1.602046 
## 
## $f__Devosiaceae
## odds ratio 
##   22.69165 
## 
## $f__Gemmatimonadaceae
## odds ratio 
##  0.7134832 
## 
## $f__Oxalobacteraceae
## odds ratio 
##   1.158663 
## 
## $f__Rhizobiaceae
## odds ratio 
##   2.542972 
## 
## $f__Rhodanobacteraceae
## odds ratio 
##   1.496563 
## 
## $f__Sphingomonadaceae
## odds ratio 
##  0.3304971 
## 
## $f__Unknown_Family
## odds ratio 
##   4.534428 
## 
## $f__Xanthobacteraceae
## odds ratio 
##   1.978932 
## 
## $f__Xanthomonadaceae
## odds ratio 
##   1.790011 
## 
## $g__Acidibacter
## odds ratio 
##   4.534428 
## 
## $`g__Allorhizobium-Neorhizobium-Pararhizobium-Rhizobium`
## odds ratio 
##   2.826685 
## 
## $g__Gemmatimonas
## odds ratio 
##    1.35984 
## 
## $g__Lysobacter
## odds ratio 
##   2.229048 
## 
## $g__Massilia
## odds ratio 
##   1.247307 
## 
## $g__Niastella
## odds ratio 
##   2.805325

6.5 - calculate proportions of the fisher test

this functions will now take p values and oods ratio out of the fisher test output and into the heat tree

# extract p values and odds ration from the fisher tests and put it into the heat tree
fisher_to_heatTree(fisher_output = fisher_output_at, 
                   ps_important_taxa = imp_ASV_ps_l$At)

What are we seeing here? each tree shows the full taxonomy of all ASVs that were “important” because they are: 1) differentially abundant in stress treatment pairwise comparison, OR; 2) predictors of stress treatments in a random forest, OR; 3) defined as keystone taxa, module hubs or module connectors in a network

The colored taxa shows adjusted p-values from the fisher test (p >0.15 is turned to grey)

By comparing the full taxonomies of these important ASVs with the the full taxonomies of all ASVs that were NOT “important”, we see that o__Rhizobiales and f__Devosiaceae are much too frequent in the “important” ASV dataset. This is likely because they consistently have major roles in stress response and community structure, wich was determined by these 3 analysis methods. note that for this summary we don’t consider ASV abundance (number of reads/sequences of each ASV) as it was already accounted for in the other 3 methods.

6.6 - make one fisher summary heat tree per species

now, run these custom functions over lists of phyloseq objects separated by species with mapply

# run the custom function voer 2 lists of philoseq objects, one with imporntat taxa and other with the full taxa (for every partition)
fisher_result_l<-mapply(function (list_1,list_2)
                   fisher_all_taxa_groups(ps_important_taxa = list_1, 
                                          ps_all_taxa = list_2),
                   list_1 = imp_ASV_ps_l,
                   list_2 = ps_l,
                   SIMPLIFY = FALSE)


#run the custom function on a list of 3 species
fisher_summaries_3l<-mapply(function(list_1,list_2)
    fisher_to_heatTree(fisher_output = list_1,
                       ps_important_taxa = list_2),
    list_1 = fisher_result_l,
    list_2 = imp_ASV_ps_l,
    SIMPLIFY = FALSE)

6.7 - Check summary output

To simply the evaluation, let’s look into the output species by species

# summary in A. thaliana
fisher_summaries_3l$At

# summary in B oleraceae
fisher_summaries_3l$Bo_M

# summary in I. tinctoria
fisher_summaries_3l$It

What is the output we have here? let’s word it in different ways

In A. thaliana, o__Rhizobiales and f__Devosiaceae are found too often in the importnat ASV subset

In B oleraceae, the proportion of o__Rhizobiales and f__Beijerinckiaceae is significantly higher in the important ASV dataset when considering the full ASV dataset

Do you see that branch from o__Streptomycetales on in I. tinctoria? considering the full dataset, there are too many representatives of that branch in the imporant ASV subset

These summary trees were not particularly highlighted as significant because the base operations to calculate differential abundance, random forest, and networks, was based on a much larger dataset with 70k ASVs, and I did not have the time to re-run it for this 1k ASV dataset.

That’s it, thank you for your time!

I hope you found the workshop insightful and that you will find the code useful as a template you can improve. feel free to share this with collegues. as you know, you will need many microbiome tutorials for a complete, updated analysis.

BONUS FOR THE CURIOUS

Are you curious about all the data wrangling of these 2 custom functions? here you can execute a hard-coded version of the function, where you can visualize each object step by step.